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ABSTRACT 

Results of the International English Language Testing 
system (lELTS) battery trials in Australia are reported. The lELTS 
tests of productive language skills use direct assessment strategies 
and subjective scoring according to detailed guidelines. The 
receptive skills tests use indirect assessment strategies and 
clerical scoring procedures. Component tests in reading, writing, 
listening, speaking, and grammar and vocabulary were developed by 
international teams for use in measuring English language competence 
and identifying suitable candidates for stu-^y in 
English-language-medium programs. The report describes the trial 
subject sample and test component characteristics, and presents and 
discusses detailed statistical results for each test item, 
reliability statistics, and data on inter-test correlations and 
interrater reliability. The grammar and vocabulary component was 
removed from the test, and some item deletions are noted. A brief 
list of references is supplied. (MSE) 
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The Nature of The Test Battery 



In a paper on the lELTS dtvelopmem presented at the fifth ALAA conference in Launceston, the 
structure, nature and procedures adopted in developing and trialing the test components of the International 
English Language Testing System battery were described (Griffin, 1988). This paper, addresses the 
results of the trials of the test battery using the data collected in the Australian component of the trials. 
The testing system focuses on both productive and receptive skills* The tests of productive skills employ 
direct assessment strategies and use subjective marking procedures guided by detailed guidelines and 
training of assessors. The tests of receptive skills employ indirect assessment strategies and employ clerical 
maricing procedures. Conversion to final band scores is based on the judgements of the test developers 
which was informed by knowledge of the candidates and the skills assessed by the tasks set and the 
directions given to test item writers. Several workshops were used to develop training methods > criteria 
and rating protocols for the productive skills of speaking and writing. Assessments of these skills were 
interpreted as being at one of ten levels or bands as described in the specifications of the tests. These 
were labelled from band 0 to band 9. Band 0 indicated no proficiency or i failure to take the test and band 
9 indicated the highest level of language proficiency, roughly equivalent to a native* like proficiency. 
This did not presume however that native speakers would always score at the highest levels. 

Direct interpretation of the receptive skills were not possible. Indirect assessment, based on paper and 
pencil tests were used. The total test scores were then used as necessary information to estimate the 
band level of these skills. Definitions of the band levels were included in the specifications of the tests. 
The reading and writing tests were designed with specific academic populations in mind* A series of 
specifications for special purpose modules focussed on sub populations in academic fields including 
Science and technology, Art and Social Sciences and Life and Medical Sciences. A further set of 
specifications was developed to cater for what was described hs a non academic, general training 
population. The reading and writing tests for each special population were contained within the same test 
booklet but have been ordered such that all reading tasks are completed before writing tasks could be at- 
tempted. 

The component tests were developed from specifications written by teams from Australia, Canada and the 
United Kingdom. The battery of tests were designed to measure English Language Competence and to 
identify suitable candidates for study in programs conducted in an English language medium. Five tests 
were originally included in the battery of tests which an individual candidate c^uld expect to take. These 
were:- 



1. 

2. 

3 

4, 

5. 



Reading 
Writing 
Listening 
Speaking 



Grammar and Lexis. 



The fifth test, that of grammar and lexis, has now been omitted from the test battery. 



Table 1 
lELTS Battery Composition. 



Component 


Code 


Pppul^tjpp focu^ 


Grammar and Lexis 


Gl 


General 


Listening 


G2 


General 


Speaking 


G3 


General 


Reading 


Ml 


Science and Technology 


Reading 


M2 


Arts and Social Sciences 


Reading 


M3 


Life and Medical Sciences 


Reading 


M4 


General Training 


Writing 


Ml 


Science and Technology 


Writing 


M2 


Arts and Social Sciences 


Writing 


M3 


Life and Medical Sciences 


Writing 


M4 


General Training 



The tests were administered throughout Australia and South East Asia by Australia's International 
development Program of the Universities and Colleges (IDP)* British Council representatives also trialed 
the test in non English speaking countries* The overall coordination of the trials of the test was conducted 
at the University of Lancaster by the lELTS project team. This report focuses on the data gathered by 
the Australian contribution to the trial forms of the lELTS. The schedule of the lEtTS trials are presented 
in the following table. 

TaHe Z 

The Schedule of Testing in the lELTS Trials. 





Component 


ftems 


Time (mi|is> 


Gl 


Grammar and Lexis 


38 


30 


G2 


Listening 


41 


30 


03 


Speaking 


n/a 


15 


Ml 


Reading 


38 


50 


Ml 


Writing 


I 


40 


M2 


Reading 


39 


50 


M3 


Writing 


2 


40 


M3 


Reading 


33 


50 


M3 


Writing 


2 


40 


M4 


Reading 


42 


50 


M4 


Writing 


3 


40 



All tests were group administered except the test of Speaking, This was of an interview format and was 
individually administered. The schedule kepi the total testing time at 110 minutes and allowed the fiill 
group testing batiery to be administered in one sitting. Not all candidates in the trials were asked to 
complete the full battery* The purpose of the trials was to establish the properties of the components and 
to establish a basis for future reliability and validity studies. 

The Trial Samples 

Trial testing, under the direction of the Australian office of the IDP took place in four countries- 
Indonesia, Thailand, Hong Kong and Australia* In Hong Kong and Australia, native speakers were 
assessed* Table 3 presents the number of candidates assessed on each test in each of the countries from 
which samples were drawn. 



Sample Sizes for Each Component Test of lELTS and Place of Administration. 



Countrv 






Test 






Total 






Qi 




MI 


M2 


Ml 


m 


Ml 




Hong Kong 


482 


463 


261 


105 


113 


121 


0 


1547 


Indonesia 


lOS 


106 


77 


67 


73 


69 


0 


597 


Thailand 


45 


47 


8 


10 


8 




0 


139 


Australia 


201 


131 


270 


257 


283 


381 


124 


1647 


Total 


843 


749 


616 


439 


477 


592 


124 


3930 



Test Characteristics: General 

A difficulty presents itself in a presentation of results about the development and trialing of the lELTS. 
Because of the security of the test, it is not possible to illustrate data using examples of test items. The 
data on each test was analysed to provide, item and total means, reliability and point biserial correlation 
coefficients for each item. Candidates were also asked to rate themselves on a nine point scale to gain 
a self assessment estimate of their band scale. This estimate is presented in the Table as SELF. In each 
test some additional questions were asked of the students. These were used for feedback to the test 
developers and the means, standard deviations and correlations with the test total score are also reported 
in these analyses. The questions were. 

FBI Do you feel that this was a fair test of your English? 

FB2 Was there enough time for you to complete the test? 

FB3 Was the test too hard? 

FB4 Was the test too easy? 

FB5 Were the questions realistic? 

FB6 Were the instructions clear? 

Item FBS was not asked for the Grammar and Lexis test. Two tables and a fi^^re are presented for each 
test in the lELTS battery. The first Table presents the following information for the General Training 
Module. This paper presents the results of the analysis of this module. Other test module results will 
become available as the manuals are released by the managemeat of the lELTS project and general data 
fro the modules based on the Australian data were presented by Griffin (1989). The general results will 
encompass both the UK data and the Australian data and may not be identical to the results presented in 
this paper* Large differences would not be expected however. The table below p[resents the general 
characteristics for the lELTS trials without presenting the specific item level data. 

Table 4 

General Characteristics of Modules in lELTS 



Module 


N 


Iienu 


Mean 




Alpha 


P 




phi 




RaKh difT 




itemriS 














mix 


Olio 


max 


min 


max 


min 




Gl 


843 


38 


26 


6.4 


82 


979 


230 


626 


114 


-3.31 


2.73 


1368 


G2 


749 


41 


23.7 


7,5 


83 


955 


116 


628 


044 


-2.78 


2.88 


1270 


ASS 


616 


38 


17.3 


8.9 


90 


787 


116 


654 


204 


-1.83 


2.13 


1950 


LMS 


439 


39 


15.8 


9.4 


92 


758 


075 


690 


287 


-3.06 


2.47 


1853 


ST 


477 


33 


14.9 


7.9 


90 


790 


073 


686 


307 


-1.36 


2.96 


1458 


GT 


592 


42 


25.2 


6.7 


80 


934 


212 


547 


145 


-2.15 


2.01 


880 



The above data illustrate the consistency across modules. They are of uniformly high reliability, have a 
wide range of item difficulty and discrimination and have suitable levels of fit to an underlying dimension 



as estimated by the proportion of item which fit the Rasch model* In addition to the test level data above, 
specific item level data was collected on the feedback items. 

(i) The feedback from the candidates regarding the suitability of the test for their purposes and the 
candidates* perception of the fairness of content, time available, clarity of instructions and ease 
or difficulty of the instrument. Where both reading and writing are presented, the same items are 
asked for each skill. The feedback items were based on a dichotomous response scored * T for 
*yes* and *0* for *No\ So the higher the value, the greater satisfaction of the candidate. 

(ii) Estimates of internal consistency coefRcients of reliability (alpha), the number of cases providing 
data for the test and the overall average score on the test. 

(iii) Standard deviations and point biserial correlations for each item are also presented. 

Tables 

General Training Reading and Writing Test: 
General Properties and Smdent Feedback 

Variable Mean SD r.tot 



M4RFB1 


1.262 


.440 


-.045 


M4RFB2 


1.658 


.474 


-.374 


M4RFB3 


1.552 


.497 


.286 


M4RFB4 


1.970 


.170 


.030 


M4RFB5 


1.157 


.364 


-.008 


M4RFB6 


1.131 


.338 


-.166 


M4RSELF 


4.587 


1.450 


.244 


M4W1 


4.293 


1.184 


.342 


M4W2 


4.453 


.891 


.369 


M4W3 


4.256 


1.037 


.301 


M4WFB1 


1.237 


.425 


-.069 


M4WFB2 


1.575 


.495 


-.264 


M4WFB3 


1.554 


.497 


.183 


M4WFB4 


1.963 


.188 


.031 


M4WFB5 


1.124 


.330 


-.114 


M4WFB6 


1.100 


.300 


-.178 


M4WSELF 


4.025 


1.357 


.211 


M4T0T 


25.182 


6.715 


ALPHA 


.845 






N OF CASES 


592 







The second table provides information on each test item. The data provided are the item mean, standard 
deviation and the point biserial coefficient. 
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Table 6 

General Training Test of Reading: 





MEAN S.D. 


r.pbi 


LOGIT ERROR 


M4A2 


.859 


.347 


.306 


-1.21 


.13 


.35 


M4A3 


.917 


.275 


.184 


-1.91 


.16 


- 04 


M4A5 


.848 


.359 


.270 


-1.18 


.13 


.03 


M4A6 


.800 


.399 


.291 


-0.80 


.11 


.09 


M4A7 


.473 


.499 


.337 


0.81 


.09 


.45 


M4A8 


.886 


.317 


.240 


-1.51 


.14 


.05 


\ M A A f\ 

M4A9 


.861 


.345 


.403 


-1.62 


.16 


-.64 


M4A1 1 


.853 


.354 


.343 


-1.17 


.13 


-.59 


M4A12 


.658 


.474 


.370 


.07 


.10 


-.70 


M4A13 


.304 


.460 


.357 


1.73 


.10 


-.91 


\ M A A % A 

M4A14 


.604 


.489 


.470 


.26 


.10 


-1.86 


\ M A A 1 C 


.888 


.315 


.339 


■1.56 


.15 


-.40 


M1A16 


.366 


.482 


.364 


1.44 


.09 


-.99 


\ M A A t ^ 


.934 


.248 


.321 


-2.15 


.19 


-1.12 


\. A A A 1 O 

M4A18 


.922 


.267 


.267 


-1.93 


.17 


-.57 




.903 


.295 


.444 


■1.87 


.18 


-1.44 


\ A A A 

M4A20 


.864 


.342 


.370 


-1.25 


.19 


-.72 


M4A21 


.841 


.365 


.356 


-1.06 


.12 


-.66 


M4A22 


.636 


.481 


.293 


.15 


.10 


1.34 


^ A M A 

M4A23 


.814 


.389 


.304 


-0.93 


.12 


.04 


M4AZ4 


.542 


.498 


.142 


.d3 


.10 


6.12 


M4A2S 


.511 


.500 


.485 


.70 


.10 


•4.45 


\ A A A 

M4A26 


.574 


.494 


.361 


.38 


.09 


-.23 


\M A A 

M4A27 


.768 


.422 


.431 


-.74 


.12 


-.32 


M4A28 


.613 


.487 


.480 


.11 


.10 


-1.55 


'^M A A ^ A 

M4A29 


.432 


.495 


.323 


.99 


.10 


2.36 


M4A30 


.488 


.500 


.314 


.67 


.10 


3.16 


M4A31 


.241 


.428 


.212 


1.99 


.11 


2.50 


\M A A 

M4A32 


.290 


.454 


.285 


1.70 


.10 


1.29 


M4A33 


.694 


.461 


.465 


-1.03 


.14 


-.82 


M4A35 


.278 


.448 


.412 


1.68 


.10 


-1.08 


"^M A A 

M4A36 


.356 


.479 


.533 


1.24 


.10 


-3.67 


M4A37 


.310 


.463 


.205 


1.52 


.11 


4.26 


\M A A 

M4A38 


.212 


.409 


.274 


2.01 


.11 


1.25 


M4A39 


.584 


.493 


.540 


-.01 


.11 


-1.73 


1^ M A A At\ 

M4A40 


.572 


.495 


.493 


-.06 


.11 


.20 


M4A41 


.295 


.456 


.426 


1.48 


.iO 


-1.03 


M4A42 


.456 


.498 


.547 


.43 


.10 


-1.95 


M4A43 


.234 


.424 


.344 


1.70 


.11 


1.22 


\ M A A A A 

M4A44 


.413 


.492 


.486 


.67 


.11 


-.19 


M4A45 


.469 


.499 


.533 


.21 


.11 


-.98 


M4A46 


.599 


.490 


.495 


-.71 


.18 


.00 


Mean 


24.68 


6.73 










Alpha 


0.79 













The general training module has a wide range of difficulty. From the table and the figure, it is evident 
that the test caters for the suitable range of candidates and discriminates at the appropriate levels. Not 
all items fit the latent trait scale. Seven of the 42 items do not appear to be measuring the same dimension 
of language as the other items. However, the remaining 35 items are, according to their fit to the Latent 
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trait, acting together to assess language ability of the candidates. This is despite the fact that the candidate 
group was obtained from a wiUe range of backgrounds, first languages and prospective courses. The test 
appears to have sound construct validity. In earlier studies of reading tests using Item response theory 
as a guide to construct validity Andrich and Godfrey (1978) analysed E^vis* test of reading 
comprehension. Their analysis argued that 80 percent of the items fitting the underlying trait gave 
sufRcient evidence of construct validity. In this case, the percentage is 83.3 percent. Hence the majority 
of items in the test are measuring the same construct. Construct validity would appear to have been 
demonstrated. The items which do not fit the underlying trait were also examined. Each involves the 
elimination of negative options of the elimination of distracting information. The block of items which 
contained most of these difficulties was eliminated from the final form of the test. 

The test was clearly not difficult overall. Apart from one set of items, M4A31 to M4A38 the items have 
high mean scores. The more difficult items have now been removed from the test as well, largely 
because of the types of tasks used in the items. Hence the overall difficulty of the test has been reduced 
somewhat alter the trials and the expected mean scores will rise. 

The Figure below illustrates the distribution of the scores of the students relative to the distribution of the 
difficulty levels of the items on the test. Where the student distribution appears to be above the item 
distribution, it appears that the test may be too easy for the candidates as a whole group. This infomation 
needs to be taken into account when interpreting the feedback item information. 

There are three scales in the figure. The first is the raw score of the students. The second is the latent 
trait logit scale and the third is the band scale for interpretation of the lELTS. This scale is an interval 
scale, based on the interval properties of the latent trait and is a linear transformation of the latent trait 
logit scale. The logit scale is derived from the application of the simple logistic model of the Rasch latent 
trait theory. It is computed from the equation 

p 

Knowledge of the characteristics of the student groups and identification of native speakers and their test 
perfcnnances were used to establish these levels. Like the assessment of the productive skills of speaking 
and writing, a professional judgement is ultimately required transform the raw test scores of the 
receptive skills onto the band scales used for reporting to consumers of lELTS information. 
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Figure 1 

Gftnerai Training Tast of Reading: Conversion from Raw Score to Band Levels. 
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Correlations amoDg the different modules of the lELTS were all obtained as were correlations of the 
lELTS battery tests with other criterion measures. Existing records were used to obtain scores from the 
Hong Kong Examinations Authority for their listening tesi, the overall GCE grade in English, a summary 
score, comprehension score and a compositional score. This enables correlations to be obtained against 
all other scores. Where available, scores on the TOEFL, the Short Selection Test (SST) the ASLPR 
(ASLR AND ASLW for reading and writing estimates), the existing ELTS and the Oxford tests forms 2 
and 3 forms A and B were obtained (02A, 02B,03A 03B). Self assessment was also gathered in that 
the students were asked to place themselves on a 9 point scale, but without any guidance as to the meaning 
of levels. These are labelled as SPR and SFW for self proficiency in Reading and Writing. Nevertheless, 
these scores enabled further insight into the bdiaviour of the lELTS battery against a range of other 
measures. Table 6 below presents the correlations of the lELTS battery with the criterion measures. 
Most of the emphasis is placed on the general training module as with the rest of the paper, and other 
criterion correlations will be made available as the manuals and other papers become available from the 
lELTS management. No correlations between the speaking test and other measures were obtained during 
the Australian trials. 
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CORRELATIONS ZELTS 1989 





G2 


MIR 


MlW 


M2R 


M2W 


M3R 


M3W 


M4R 


H4W 




M4R 


772 


588 




593 




430 








MAD 


N 


242 


71 




68 




48 








n 


M4W 


47S 










256 


577 


449 




M4W 


N 


201 










16 


7 


222 




M 


ELTS 


826 


258 


388 


448 


712 


203 


273 


9 A *• 


■t ■» w 


FT TC 


N 


11 


23 


22 


12 


12 




Q 


o 


Q 


M 

n 


TOEFL 


804 


879 


678 


704 


S69 


866 

www 


W X 7 


w ■» f 


702 

/ w^ 




N 


66 


15 


16 


18 


19 


21 


21 


A 

w 


6 

w 


M 

n 


SST 


-7S3 






*269 


• 535 




/ WW 


w 7 w 




C CT 


N 


39 






24 


24 




Q 


?7 


2 1 


M 

n 


02A 


492 


















n2 A 


N 


136 


















M 

41 


03A 


510 




















N 


54 


















M 

n 


HKGRADE 




-602 


-614 


-446 


-460 


^^416 


-411 


-297 

*• 7 / 


-504 


^ A w 


HKGRADE 






















N 


218 


60 


60 


48 


48 


W 4t 


w^ 




2Q 

& 7 


n 


HKSUMRY 






638 


402 


441 


507 


419 


314 


www 


n 
\j 


HKSUMRY 






















N 




60 


60 


48 


48 


63 


63 

W «J 


30 


29 

A 7 


M 


HKLIST 




484 


















HKLISTN 






















N 


218 


















M 

41 


HKCOMPOS 






531 


407 


248 


372 


117 


282 


464 

^ w ^ 


0 

w 


HKCOMPOS 






















N 




60 


60 


68 


48 


63 


63 


30 


29 


N 


SPR 


406 


404 


508 


472 


562 


363 


384 


254 


192 


SPR 


N 


402 


225 


231 


98 


87 


104 


104 


342 


177 


N 


SPW 




351 


460 


475 


520 






149 


235 


SPW 


N 




189 


189 


94 


93 






219 


145 


N 



While many of the sample sizes are small, the correlations are encouraging. Moderately high and 
appropriately signed correlations have been obtained with all modules with the TOEFL, the SST, the 
Oxford tests and the Hong Kong GCE Examination results. Too few cases were obtained to make any 
interpretation of the ASLPR ratings. This however should be easy for the lELTS Australia to remedy in 
the future. The evidence is encouraging for the lELTS battery in terms of critenon validity. It is clear 
that the lELTS is measuring language proficiency in the same dcmain oieasured by similar test batteries. 

The correlations between te reading tests in the modules are also generally high, indicating that the tests 
are generally measuring the same underlying variable. This has been hirther explored by Alderson (1990) 
in his comparison of the Australian data with the combined UK and Australian data. The intercorrelations 
among the reading modules are presented in Table 8 below. 
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Tables 

Intercorrelations Among ttie Reading Tests. 





Arts 


Sci 


Gen Tmg 


Gram 


List 


Life Med 


58 
(90) 


65 

(114) 


59 
(68) 


S8 

(88) 


66 

(88) 


Arts 




47 

(100) 


58 

(74) 


69 

(198) 


62 

(198) 


Sci Tech 






49 

(60) 


80 

(123) 


79 

(123) 


Gen Trng 








78 


77 

(123) 


Gramm 








79 





(123) 

Two things arc noticeable. First , the generally low correlations of the general training module with the 
other reading tests and second, the generally high correlations of the grammar test with all other tests. 
Alderson, also illustrates this relationship and classifies the grammar test as a reading test, as is the 
listening test. 



Reliability: 

Reliability can be assessed from two aspects. First there is the classical internal consistency reliability 
estimates, and second there are the item level reliability or error estimates available fix>m the latent trait 
analyses. Table 5 presents the error estimates and the internal consistency estimate of 0.79. The latent 
trait analyses illustrates the high item level reliability given that few item exceed errors of 0<20. These 
figures illustrate the reliability of the reading tests in the lELTS battery and in particular the reliability 
of the General training module. Reliability estimates assisted in the decision to remove the grammar and 
lexis test from the test battery. 

The test of lexis and grammar was omitted from the lELTS battery after examination of reliabilities and 
after examination of issues underpinning the test. The four remaining tests all assess either a productive 
or receptive language skill. The test of grammar and lexis tested knowledge about language rather than 
the ability to use it for communicative purposes. In addition, there was no suitable scale of progression 
which could be developed for interpretation and reporting as with the other tests. While professional 
judgement is ultimately needed for reporting the levels of attainment on the reading and listening tests in 
terms of lELTS band levels, no similar translation could be provided for the test of lexis and grammar. 
These substantive reasons together with the lack of contribution to reliability beyond that which could be 
achieved by increasing the number of items in the reading test. This helped the management of lELTS 
to decide to recommend its omission from the battery* The table below illustrates the contribution of the 
lexis and grammar 'est to the overall battery of clerically scored tests and the overall reliability of the 
combined tests with the conflated module. In all cases it can be seen that the addition of Gl to the battery 
produced small gains in reliability that could have been achieved with additional items on the reading tests. 
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Table 9 



Effect of Gl on RgiiiihiHtv of Obiectivt BiitH>rv 



QXiH ALONE 



coMPrr^EP a 



mm 



Ml 
M2 
M3 
M4 
M5 



906 
909 
857 
933 
977 



924 
935 
919 
949 
964 



177 

79 

88 

240 

41 



117 
117 
111 
117 
122 



A second omission from the final test battery was the conflated version of the test. The fifth module was 
constructed as a combination of the academic modules and a separate set of specifications was to be 
developed for the module. Despite the administrative gains that were to be had by the development of 
a single academic module, the face validity of special purpose modules led the steering committee to omit 
the conflated module from the test battery as well. 

Probably the most difficult issue to address is the reliability of productive skills in language. Constable 
and Andrich (1984) examined the circumstances in which judges are required to assess productive type 
skills and are required to give ratings of performances. The usual case in which raters are trained to give 
similar ratings were examined and the paradox of higher correlations among the performances with 
constancy of ratings among raters, leading to higher reliability and lower validity were discussed. The 
recommendation of application of person judgement interaction was recommended and is followed in th.s 
examination of reliability of the writing scales. 



Traditional notions of reliability depend on the degree to which the method of assigning scores eliminates 
measurement error. Four potential sources of error have been identified for the assessment of writing. 
These are.. 

(a) The writer within-subject individual differences, 

(b) Variations in task 

(c) Between-rater variations 

(d) Within-rater variations. 

To reduce within-subject error, a pool of similar tasks is often used. However, since essay writing is time 
consuming it is often logistically difficult to have students write several essays under examination 
conditions. In the lELT System the largest number of writing tasks set for any candidate is three in the 
General Training module. In all other modules the candidates are asked to write just two essays and there 
is a deliberate attempt to vary the nature of the task in order to increase the sample of writing styles. This 
is typical of essay examinations as task structures often differ with variation in topic. Within-subject task 
based variation has been traditionally difficult to control. In reducing variability due to task two parallel 
assignments or tasks have often been used. The most prevalent issue associated with writing assessment 
reliability is that of inter rater reliability. Statistical indices of agreement include coefficient alpha, 
generaiisability coefficients, point biserial correlations, and simple percentages of agreement. 

The most effective method found to reduce variation between raters is to provide training on specified 
criteria. Control of within-rater variability over time involves the use of periodic checks and common 
reference standards such as exemplar essays. However, in assessing raters as well as the ratings for 
reliability it may be useful to examine the sUbility of individual ratings and of tasks in terms of the 
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attribute being af sessed* 



The traditional definition of reliability from the classical or "true score'' model is tLc proportion of 
variance that is due to the samplers true score variance* It depends on the average error variance of the 
test which arises from a variety of sourer i. Reliability is often estimated by calculating the correlation 
between repeated measures of the same entity such as an essay. However, reliability is a property of a 
variable not the test. It is a property f the measure that is obtained from the test. This can be interpreted 
as a line along which objects {on this case essays) can be positioned. The positions on the line need to 
be interpreted so equally spread intervals are required. In the ;issessment of lanj^age these are usually 
defined by various descriptions of language behaviour vsliich are placed on a rating ^^e. la it i^ case Ae 
rating points on the scale form the levels of proficiency used for rq>orting the assessments. Often the 
rating points are assumed cr declaird rather than defined via empirical methods. Gne empirical method 
for calibrating the units of measurement on the variable is through the application of itSm response theory 
(IRT). This brings together the notion of a person ability (or judgement) and the quality of aa item (or 
essay) and enables a probabilistic statement about the person's judgement and the essay quality. 

The rating assigned to an essay by a judge depends or a number of things. It depends on the quality of 
the essay and the dimension of quality that the judge uses. In proficiency assessment, the judge would 
be expected to use accepted notions of proficiency to assess the student as exhibited in the sample of 
writing in the essay. It depends on the raters ability to interpret the writing proficiency. This cculd be 
called the rating tendency of the rater and is commonly called the **rater e.tcct \ 

It is typical of language assessment that the same set of rating points is used by all judges with every 
essay. Because of this it is usually considered that the relative proficiency levels associated with the 
rating points should not vary from essay to essay. That is an interpreution of the score of level 1 remains 
constant as do the interpretations of each level on the band scales. This consistency of score 
interpretation is usually associated With a fixed scale in this case called the band scale. For this reason 
a Rasch rating scale model has been adopted (Rasch, 1960; Andrich, 1978; Wright and Masters, 1982) 

The model is denned by the equation: 



where P is the probability of a specific rating being assigned, 

X - i to m represents the number of steps in the rating scale. 

T is the half distance on the variable between rating points and is then^fore the threshold 

from one rating point to another. 

d is the proficiency level for a specific rating point. 

B is the rater tendency of the judge. 

j is the number of essays judged over the m steps. 

In this model successive levels are "recognised" once a threshold is passed so that ^ is .ne essay 
competence level and d +T is the threshold at which the judgement changes from a 1 to a 2. 

The latent trait or variable is defined b the performances on tasks whica require increasing amounts of 
attribute or proficiency. In this case however, the tasks are set and the performances vary according to 
student proficiency variation. If the trait exists among the judges, than they would sort essays according 
to their perception of the amount of trait exhibited in the essays and "levels" along the variable. Sorting 
would be based on the amount of writing proficiency. So the group of "expert"" judges were asked to sort 
essays. If the sort of essay scripts were consistent across judges then a recognisable variable will have 
been identified and rater reliability should be high. If the sort were inconsistent across essays and 
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iDdividual essays were assigned to too great a range of levels the reliability would be low and no 
underlying variable u>uld be identified. 

With consistent sorting the criteria used by the judges can be used to define the nature ot the variable and 
would ultimately define the criterion scale. It is possible that the same set of essays could be sorted 
according (o a range of criteria, each defining a different underlying variable. Where this is the case 
''sort'* might be erratic and individual essays would not be consistently assigned to levels. Moreovet 
juoges would not be consistently ordered with respect to their "rater effect Under these circumstance^, 
the reliability of the variable and the reliability if the judges would be low. 

With these principles of item response theory in mind a series of workshops were organised in which 
judges would sort essays, articulate their criteria and establish a basis for estimates of both inter and intra 
rater reliability. However, the usual approaches to reliability estirn" <on developed through classical item 
analysis are inappropriate and tend to give false informatior p* the definition of the variable and the 
fit of thr judges and the essays to the variable, Skehan's ( 1 9 Z^^; 1989) papers point out the advantages of 
the item response theory approach to reliability estimation. Ak vvever, tLere is an added advantage to those 
listed by Skeehan in that generalisability theory can also be used arising from the use of item response 
theory. 

In assessing writing competence, essays are used as samples of work and a homogeneous set of essays can 
be used to define the rating points representing levels or levels on a variable defined as ''writing 
competence". This is the first step in investigating the average variation in marking and identifying the 
components due to true score, the extent to which the essays do actually define a variable of writing 
competence and the extent to which raters use specified criteria. Two pieces of information then become 
available. Each essay can be assessed for its deviation from an expected position on the variable and its 
Tit " to the variable together with the estimates of error used as an estimate of its reliability. That is, 
reliability can focus on the essay at an individual level, and at the individual candidate level. 

Given that essays are used to define the variable, the raters can also be placed along the variable using 
item response theory according to their predisposition for marking high or low on the variable (or placing 
essays in relative locations on the variable). If the variable is also defined for the raters in terms of 
specified criteria or descriptions of writing competence, than the variability among raters can be specified 
in terms of those descriptions. The information obtained from these procedures and the latent trait 
analysis may enible an examination of issues related to the effect of moderation, training and exemplar 
scripts. 
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Tab'.* 10. 
RATER STATISTICS 
ASSESSMENT 1 ASSESSMENT 2 



NAME 




MEAS 


ERROR 


FIT 


MEAS 


ERROR FIT 


A 


. 12 


. 34 


-2 . 73 




. 16 


.22 


-4.28 


B 


. 61 


. 36 


-2.84 




. 32 


.23 


-4.39 


C 


. 32 


.31 


-2 . 27 




. 20 


.19 


-3.24 


D 


. 53 


.28 


-1.73 




-.19 


.23 


-4.36 


E 


.53 


.34 


-2.63 




.25 


.20 


-3.49 


F 


.22 


.29 


-1.84 




. 17 


.15 


-1.02 


G 


.12 


.27 


-1.59 




.59 


.18 


-2,84 


H 


.64 


.21 


.27 




.73 


. 15 


-1.13 


I 


.36 


.37 


-2.97 




.53 


.20 


-3.34 


J 


.53 


.30 


-2.06 




.41 


.24 


-4.69 


K 


.19 


.26 


-1.38 




.43 


.30 


-6.15 


L 


.53 


,25 


-1.03 




.62 


.15 


-1.10 



For assessment 1 , six of the twelve raters appear to ''fit'' the underlying variable. On occasion 2 however, 
few raters appear to ''fit'* the variable. There api^^^ears to have been a change in the criteria or in the nature 
of the variable being used to assign scripts to levels. The original criteria used in the familiarisation 
workshop and reinforced in the training woricshop, do not seem to have been used for assessment 2. 
Unfortunately it was assumed that the criteria would remain the same, and were in fact supplied to the 
raters. One curious point to examine is whether the apparent change in the criteria being used alters the 
rank o^er of the scripts for assessment two. This is examined in the analysis of script levels presented 
in Table 1 1 below. The results suggest that the rank order on the variable and the way in which the scripts 
have been assigned has !:ct changed enough to warrant the rejection of the assigned scores. There is 
clearly a problem with the scoring of scripts in that the raters do not use a common set of criteria, neither 
when engaged in moderation nor when scoring solo. The training and selection of markers and their 
stability f 'ntings have become a focus of the lELTS management and ^.corers are required to undergo 
training with regular updates and monitoring to ensure that there is consistency among those chosen and 
those retained* These issues identified in the trials have assisted in developing appropriate training and 
monitoring procedures to ensure the consistency of raters used in marking the essay scripts. 

Even in these trials, from a training perspective, there is a noticeable reduction in the variation of rating 
tendency, but the cost is high in terms of the ability of the raters to place the scripts along the variable 
of increasing competence* While the analysis appears to highlight this weakness, it would not be apparent 
under normal or classical analyses. The factor introduced to the assessment was the use of reference 
scripts and a consensus approach to allocation to levels. As can be seen later, the allocation to levels was 
no^ unanimous and three raters whose scores differed by considerable amounts adhered to their judgements 
leading to large residuals in the analysis, lack of fit among the raters and for the reference scripts. 
Despite this, there appears to be a maintenance of the range of script scores a move towards the ideal 
effect of training. That is, the range of ratings for the scripts has been maintained, covering ratings from 
3 to 9 but the range of rating tendencies has been diminished. However, the analysis points out the 
problem of achieving this. There must be changes in the intra-rater scores in order to get this result. 
Hence there has been a loss of inter rater reliability from the first to second and third rating occasions. 
Moreover, the high agreement among raters on the second and third ratings means that there is very little 
variance and hence classical reliability estimates will be low. This is in fact the case, as the Latent trait 
estimates of person sq>aration indices are low for occasion 2 and 3* (0.40 and 0.39 respectively). The 
item separation indices are high however, at levels of 0.74 and 0.77. (Wrif^lit and Masters, 1982)* These 
indices reflect the oiscussion of Figure 1 and indicate the dilemma of raier studies. Low separation of 
raters needs to be coupled with higher separation of scripts. Hence the item response analysis in rater 
studies needs a very low person separation index and a high item sq)aration index. These results appear 
to suggest that even after training, raters revert to their own criteria when marking solo. The implications 
for method of marking appear obvious. Moderation of non clerical marking procedures is essential. 



While the raters did not appear to be consistent with the application of criteria, the effect on the bac;ds did 
not seem to be as severe* 

Table U 

Script Assessment Time 1 and Time 2. 



NAME 


Tl 


ERROR FIT 


T2 


ERROR 


FIT 


MllA 


3.23 


35 


-1 97 


3 38 


76 


-1.11 


0.32 


MllB 


4.33 


35 


- 80 


4 76 


78 


.19 


0.10 


MllC 


1.96 


30 


- 74 


1 67 


41 


-1.61 


-0.17 


MUD 


2.62 


31 


-1 45 


7 64 


77 


-.83 


0.19 


MllE 


1.69 


42 

. "Tit 


-1 83 


1 74 


43 


-1.86 


0.22 


M12A 


2.88 


32 


-1 64 


3 17 


35 


-2.52 


0.41 


M12B 


3.18 


26 


- 67 


7 68 


70 


-1.14 


-0.33 


M12C 


2.07 


33 


-1 32 


2 01 


35 


-1.25 


0.11 


M12D 


2.84 


41 


-2 70 


3 25 


71 


-5.29 


0.58 


M12E 


1.79 




1 77 


i .oo 




-1.28 


0.06 


M21A 


2.36 


.42 


-2.47 


2.36 


.35 


-1.56 


0.17 


M21B 


3.86 


.30 


.99 


3.76 


.25 


-.40 


0.07 


M21C 


1.36 


.33 


-.70 


1.16 


.32 


-.35 


-0.03 


M21D 


2.40 


.41 


-2.46 


2.07 


.30 


-.58 


-0.16 


M21E 


2.62 


.53 


-3.52 


2.50 


.36 


-1.83 


0.05 


M22A 


2.54 


.34 


-L79 


2,59 


.54 


-3.38 


0.22 


M22B 


3.49 


.36 


'L75 


3.76 


.51 


-3.64 


0.43 


M31E 


1.56 


.28 


.32 


1.50 


.29 


-.00 


O.Il 


M43C 


4.06 


.3/ 


1.23 


4.78 


1.09 


-2.60 


0.89 



Shifts in the values assigned to scripts were examined using common item equating, methods. Mean item 
measures for each occasion were used to compute the link shift for the items. (0. 17). In the table only 
adjusted **attribute ''values are shown. Three scripts changed from "non fit " to ''fit'' on the second 
assessment and three scripts reversed this. All others in the link set were found to '*fit* the writing 
proficiency variable. While the raters have unstable ''fit*' characteristics, this may have been influenced 
by the new scripts marked on the second occasion. It does not - ^em to have influenced the ranking of 
scripts from the initial assessment. 

It is noticeable that scripts with high fit statistics also have the largest translation shifts associated with the 
equating across occasions (T1*T2). This indicates that these scripts have characteristics which tend to 
confuse the ratings and introduce secondary characteristics not included in the criterion scales. However, 
the size of the fit statistics is expected to be large, given that there were only 15 raters on each occasion 
and 43 scripts on occasion 1 and 20 scripts on occasion 2. The effects of training should be observable 
in the consistency of the ration as discussed above. Probably the most telling information is the change 
in the ''fit'* statistic. The test used is commouly called the Infit statistic, which applies a chi-squared-like 
test to residuals. The test is sensitive to outliers. Hence the effect of raters whose judgement differs 
considerably from others will have an enhanced effect. This is mostly the case with scripts 

This study has illustrated that conventional estimates of rater reliability loose much of the available 
information and enter the researcher into a paradox when inter rater reliability is maximised. By reducing 
the variation among raters, the classical approach to reliability is jeopardised. Latent trait analyses 
provide item and person specific measures of reliability or error variance and these may be used to 
advantage in examining trends in the data< 
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